-
Notifications
You must be signed in to change notification settings - Fork 117
RSDK-10618: Fix data race when modules crashes during reconfiguration #4975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
module/modmanager/manager.go
Outdated
// There is a circular dependency that causes a deadlock if a module dies | ||
// while being reconfigured. Break it here by giving up on the restart if we | ||
// cannot lock the mananger. | ||
if locked := mgr.mu.TryLock(); !locked { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm reading this right, if a module crashes at the same time the mod manager has the lock for an unrelated reason (perhaps adding a resource to a different module), we'll give up on restarting the module? Is there some other means that would restart it?
And I suppose the problem (based on the comment) from just using the mod.inRecoveryLock
here is that module reconfiguration does not take the in recovery lock, but would need to (for the purpose of serializing with restarts -- which it definitely does from a logic perspective)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll give up on restarting the module
Correct
Is there some other means that would restart it?
Not that I'm aware of
And I suppose the problem (based on the comment) from just using the mod.inRecoveryLock here is that module reconfiguration does not take the in recovery lock...
Correct, that lock was only ever taken by the restart callback. I actually removed it (and that weird atomic bool) since taking the modmananger lock made it unnecessary.
Ok, this should be good to go now. Sorry for the extra noise that moving
|
// closed. | ||
// Always mark pendingRemoval even if there is no more work to do because the | ||
// restart handler checks it to avoid restarting a removed module. | ||
mod.pendingRemoval = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not your code, just scrutinizing this variable now that I know it exists. This variable is also used to "hide" modules that are in shutdown, but otherwise still up. I can't comment off-diff (obligatory: GitHub PR is not a professional CR tool). But in modManager.getModule
, we have this code:
func (mgr *Manager) getModule(conf resource.Config) (foundMod *module, exists bool) {
mgr.modules.Range(func(_ string, mod *module) bool {
var api resource.RPCAPI
var ok bool
for a := range mod.handles {
if a.API == conf.API {
api = a
ok = true
break
}
}
if !ok {
return true
}
for _, model := range mod.handles[api] {
if conf.Model == model && !mod.pendingRemoval {
foundMod = mod
exists = true
return false
}
}
return true
})
return
}
Does that set us up for a race when "delayed" removal of module foo
V0 + adding a new component depending on module foo
V1?
Most of the mod manager operations are under a global lock (at least the ones related to module process lifetimes). So I suppose we're good. But just asking out loud now that I've been introduced to this nuanced module state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I don't think this set of changes creates any new races, since it's just setting the pendingRemoval
field on modules that are being removed from the manager anyway. That said, it looks like a lot of codepaths to getModule
don't take the lock (I guess because they're just reads?), so there may be more preexisting race conditions there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obligatory: GitHub PR is not a professional CR tool
New 20% project: set up Viam Gerrit instance?
return | ||
} | ||
|
||
// Something else already started a new process while we were waiting on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this. I know you were describing how the code likes to re-use the same module object. But I think we'll want something more satisfying than "something else did something we only know how to observe via a Status
call".
What's the sequence of public API calls + module crashes that leads to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two cases we need to account for: module process crashes while being removed, and module process crashes while being reconfigured. The case this check addresses is the crash during reconfigure. The modmanager
will kill and restart the module process, so by the time the exit handler has the lock to proceed a new process may already be running and we need to avoid starting a second one. Here's by best attempt at showing the goroutines interleaving in markdown form:
modmanager goroutine | module.process.manage goroutine |
---|---|
manager.Reconfigure() | <call to cmd.Wait() returns w/ err> |
<manager.mu locked> | <module.process.mu locked> |
manager.closeModule() | |
module.stopProcess() | |
module.process.Stop() | |
<block on module.process.mu> | |
<reaches OUE handler, module.process.mu unlocked> | |
<module.process.mu unlocked, lock it and complete call to Stop> | manager.newOnUnexpectedExitHandler |
manager.startModule() | <block on manager lock> |
<various further calls assign a new running managedProcess to module.process> | |
<manager.mu unlocked> | |
<unblocked, lock manager.mu> | |
<module.process.Status() == nil, abort restart> |
The first fix I tried was passing in the managedProcess
pointer to the exit handler along with the exit code and then checking if it was the same as the pointer stored in module.process
. If the pointers differed then we lost the race and a new managedProcess
instance had been set up on the module, so we should abort the restart. That felt cleaner than checking the current process status but required changing a public API in pexec so I replaced it with the current implementation. I could probably find/create some other unique value that changes each time the module restarts to accomplish the same thing if we don't want to check the process status. Top-of-mind are:
- Create a new module struct each time the module is restarted and compare the module pointers.
- Use
ManagedProcess.ID()
to check if the process instance has changed. The docstring onManagedProcess.ID()
says it "returns the unique ID of the process", but currently we just set it to the module name which in this context is not unique. I could add some random or incrementing value to that string to make it unique but I'm not sure if anything else relies on that ID to map back to module names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all makes sense. Thanks for the detailed writeup. I'm going to leave some remarks re: documentation. But in hindsight, what you have is pretty good already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree with the assessment re: checking a managed process pointer (or something more logical). That would be preferable, but not so preferable to break the API further.
module/modmanager/manager.go
Outdated
// Log error immediately, as this is unexpected behavior. | ||
mod.logger.Errorw( | ||
"Module has unexpectedly exited.", "module", mod.cfg.Name, "exit_code", exitCode, | ||
) | ||
|
||
// Take the lock to avoid racing with calls like Reconfigure that may make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested documentation:
Noting that a module process can be restarted while preserving the same `module` object. And consider the case where a module is being restarted due to a configuration change concurrently with the existing module process crashing. There are two acceptable interleavings:
1. The `onUnexpectedExitHandler` restarts the module process with the old configuration.
1a) and the Reconfigure then shuts down + restarts the (freshly launched) module process with one using the updated configuration.
2. Or, the `Reconfigure` executes and starts the module process[1] with the updated config. The `onUnexpectedExitHandler` will still run. But will become a no-op.
For the second scenario, we check our assumptions after acquiring the modmanager mutex. If the module process is running, there is nothing for us to do.
[1] I'm seeing that reconfigure has this code for stopping the existing module process:
if err := mod.stopProcess(); err != nil {
return errors.WithMessage(err, "error while stopping module "+mod.cfg.Name)
}
Have we confirmed that, in our case of racing with a crash, stopProcess
there will not return an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In both outcomes of the race (either Stop or manage getting the process lock first) the call to managedProcess.Stop() will return an error reporting that the process it tried to kill did not exist, which we whitelist and stop propagating in module.stopProcess, so the module stop code will proceed normally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimistically approving. Left a comment that just adds more words to the race your documentation alludes to. Feel free to correct any inaccuracies or rewrite to your tastes.
This has a matching change in viamrobotics/goutils#432 that actually fixes the race condition. Unfortunately the way
modmanager
is written meant that fixing the race caused a deadlock, which this PR addresses.